Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impl std::fmt::Write for StringViewBuilder / BinaryViewBuilder #6777

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 22, 2024

Which issue does this PR close?

Closes #6373

Rationale for this change

@tlm365 is trying to implement cast from various types to StringView efficiently in

Also, DataFusion had several places where it would like to write to a StringViewBuilder using standard rust formatting such as write!

What changes are included in this PR?

  1. impl std::fmt::Write for StringViewBuilder / BinaryViewBuilder
  2. Tests + documentation

Open questions / follow on PRs:

  1. Should I implement the same sort of "finish the row on drop" style API for StringBuilder

Are there any user-facing changes?

Yes, new APIs + docs

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 22, 2024
@alamb alamb marked this pull request as ready for review November 22, 2024 22:38
@tustvold
Copy link
Contributor

I think this will perform identically to just writing to a temporary String manually?

I wonder if we could do something a bit more sophisticated, but then again perhaps it won't make a difference.

I like the Drop based flush, although it does introduce a bit of API inconsistency...

@@ -88,6 +86,9 @@ pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
/// map `<string hash> -> <index to the views>`
string_tracker: Option<(HashTable<usize>, ahash::RandomState)>,
phantom: PhantomData<T>,
/// temporary space used by [`ByteViewFormatter`]
/// to avoid extra allocations
temp_buffer: Option<String>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation allocates a single String which is used for all write! calls to the same StringViewBuilder, which should keep allocations to a mininum.

However, this implementation will copy the strings twice:

  1. once into the temp buffer String
  2. once into the final view

For short strings (less than 12 bytes) this copy is likely required anyways (unless we are manipulating the view buffer directly)

For longer strings (greater than 12 bytes), we could avoid the extra copy into the String buffer by writing directly into the underlying buffer.

I actually tried to avoid the second copy, but found it was quite complex due to:

  1. Having to respect the string_tracker (deduplication)
  2. handle the block size re-allocation correctly

Thus I went with the simple approach here which should get most of the benefits, and figured we can optimize further if needed.

/// - String buffer count exceeds `u32::MAX`
/// - String length exceeds `u32::MAX`
#[inline]
pub fn append_str(&mut self, value: &str) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a fancier way to write the generics to avoid introdicing a append_str and append_bytes API, but I couldn't find it.

///
/// Safety: caller must ensure value is a valid native type (e.g. valid
/// UTF-8 for `StringViewType`).
unsafe fn append_bytes(&mut self, v: &[u8]) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially mark this as pub too but I figured I would keep the proposed API changes small

@alamb
Copy link
Contributor Author

alamb commented Nov 23, 2024

I think this will perform identically to just writing to a temporary String manually?

Yes that is my understanding -- this API just attaches the temporary String to the builder if needed

I wonder if we could do something a bit more sophisticated, but then again perhaps it won't make a difference.

I tried doing something more sophisticated (specifically building up the view incrementally, and writing to the buffer) but I gave up due to #6777 (comment) (which I forgot to submit yesterday)

I like the Drop based flush, although it does introduce a bit of API inconsistency...

Yeah, it is different than the StringBuilder API. I could add the same formatter() style API to the string builder too 🤔

///
/// Panics if
/// - String buffer count exceeds `u32::MAX`
/// - String length exceeds `u32::MAX`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would likely run out of stack space first :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe heap space 🤔

arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
@alamb alamb marked this pull request as draft December 16, 2024 19:21
@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2024

I believe it is not clear we want to pursue this particular pattern, so marking it as draft for nw

@alamb alamb closed this Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement std::fmt::Write for StringViewBuilder to permit non contiguous data to be written
3 participants